Predict Credit Card Defaulters

Problem Statement

In this case study, we are going to build a classifier to calculate the probability of a customer defaulting their credit card bills.

Dataset

credit-default.csv

Dataset Description:

Each row is about a customer. We have details about their savings, employment, age, marital status etc. In default column (target column), we have value 1, if the customer has not defaulted and the value 2, if the customer has defaulted.

Importing the important Libraries

In [28]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, auc, roc_curve
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
In [2]:
pd.set_option('display.max_row', 100000)
pd.set_option('display.max_columns',500000)
In [3]:
dt=pd.read_csv('credit-default.csv')
In [4]:
dt.head()
Out[4]:
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors residence_history property age installment_plan housing existing_credits default dependents telephone foreign_worker job
0 < 0 DM 6 critical radio/tv 1169 unknown > 7 yrs 4 single male none 4 real estate 67 none own 2 1 1 yes yes skilled employee
1 1 - 200 DM 48 repaid radio/tv 5951 < 100 DM 1 - 4 yrs 2 female none 2 real estate 22 none own 1 2 1 none yes skilled employee
2 unknown 12 critical education 2096 < 100 DM 4 - 7 yrs 2 single male none 3 real estate 49 none own 1 1 2 none yes unskilled resident
3 < 0 DM 42 repaid furniture 7882 < 100 DM 4 - 7 yrs 2 single male guarantor 4 building society savings 45 none for free 1 1 2 none yes skilled employee
4 < 0 DM 24 delayed car (new) 4870 < 100 DM 1 - 4 yrs 3 single male none 4 unknown/none 53 none for free 2 2 2 none yes skilled employee
In [5]:
dt.columns
Out[5]:
Index(['checking_balance', 'months_loan_duration', 'credit_history', 'purpose',
       'amount', 'savings_balance', 'employment_length', 'installment_rate',
       'personal_status', 'other_debtors', 'residence_history', 'property',
       'age', 'installment_plan', 'housing', 'existing_credits', 'default',
       'dependents', 'telephone', 'foreign_worker', 'job'],
      dtype='object')
In [6]:
dt.shape
Out[6]:
(1000, 21)
In [7]:
dt.describe()
Out[7]:
months_loan_duration amount installment_rate residence_history age existing_credits default dependents
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 20.903000 3271.258000 2.973000 2.845000 35.546000 1.407000 1.300000 1.155000
std 12.058814 2822.736876 1.118715 1.103718 11.375469 0.577654 0.458487 0.362086
min 4.000000 250.000000 1.000000 1.000000 19.000000 1.000000 1.000000 1.000000
25% 12.000000 1365.500000 2.000000 2.000000 27.000000 1.000000 1.000000 1.000000
50% 18.000000 2319.500000 3.000000 3.000000 33.000000 1.000000 1.000000 1.000000
75% 24.000000 3972.250000 4.000000 4.000000 42.000000 2.000000 2.000000 1.000000
max 72.000000 18424.000000 4.000000 4.000000 75.000000 4.000000 2.000000 2.000000
In [8]:
dt.isnull().sum()
Out[8]:
checking_balance        0
months_loan_duration    0
credit_history          0
purpose                 0
amount                  0
savings_balance         0
employment_length       0
installment_rate        0
personal_status         0
other_debtors           0
residence_history       0
property                0
age                     0
installment_plan        0
housing                 0
existing_credits        0
default                 0
dependents              0
telephone               0
foreign_worker          0
job                     0
dtype: int64
In [9]:
dt.dtypes
Out[9]:
checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
default                  int64
dependents               int64
telephone               object
foreign_worker          object
job                     object
dtype: object
In [12]:
dt['default'] = dt['default'].replace({1:0, 2:1})

Feature-Engineering

In [13]:
plt.figure(figsize=(20,20))
sns.heatmap(dt.corr(),annot=True,cmap='coolwarm')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e458d060b8>
In [14]:
def histogram(data,path,color,title,xaxis,yaxis):
    fig = px.histogram(data, x=path,color=color)
    fig.update_layout(
        title_text=title,
        xaxis_title_text=xaxis, 
        yaxis_title_text=yaxis, 
        bargap=0.2, 
        bargroupgap=0.1
    )
    fig.show()
In [15]:
dt_new=dt[dt['default']==0]
In [16]:
dt_new=dt_new.groupby('default')['credit_history'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [17]:
dt_new
Out[17]:
default credit_history Percent
0 0 repaid 51.57
1 0 critical 34.71
2 0 delayed 8.57
3 0 fully repaid this bank 3.00
4 0 fully repaid 2.14
In [18]:
px.bar(dt_new, x='default', y='Percent', color='credit_history',title="Default as '0' w.r.t Credit History" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows the different percentage of credit history w.r.t default value as 0.
In [19]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['credit_history'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [20]:
px.bar(dt_new, x='default', y='Percent', color='credit_history',title="Default as '1' w.r.t Credit History" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows the different percentage of credit history w.r.t default value as 1.
In [21]:
histogram(dt,"credit_history","default",'Default on Credit History','Credit History','Count')

Observation

  • This graph shows the different count of credit history w.r.t default value as 0 and 1.
In [22]:
histogram(dt,"age","default",'Age count on default','Age Distribution','Count')

Observation

  • This graph shows age distribution w.r.t default value as 0 and 1.
In [23]:
histogram(dt,"months_loan_duration","default",'Months loan count on default','Months  Distribution','Count')

Observation

  • This graph shows months distribution w.r.t default value as 0 and 1.
In [25]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['purpose'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [26]:
px.bar(dt_new, x='default', y='Percent', color='purpose',title="Default as '1' w.r.t purpose" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows purpose ploting with default value as 1.
In [27]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['purpose'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [28]:
px.bar(dt_new, x='default', y='Percent', color='purpose',title="Default as '0' w.r.t purpose" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows purpose ploting with default value as 0.
In [29]:
histogram(dt,"purpose","default",'Default on purpose','Purpose','Count')

Observation

  • This graph shows purpose ploting with default value as 1 & 0
In [30]:
histogram(dt,"amount","default",'Default on amount','Amount','Count')

Observation

  • This graph shows the default value as 1 & 0 w.r.t to the amount of credit
In [32]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['savings_balance'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [33]:
px.bar(dt_new, x='default', y='Percent', color='savings_balance',title="Default as '1' w.r.t savings balance" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Savings balance ploting with default value as 1
In [34]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['savings_balance'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [35]:
px.bar(dt_new, x='default', y='Percent', color='savings_balance',title="Default as '0' w.r.t savings balance" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Savings balance ploting with default value as 0
In [36]:
histogram(dt,"savings_balance","default",'Default on savings balance','savings_balance','Count')

Observation

  • This graph shows Savings balance ploting with default value as 1 & 0
In [38]:
histogram(dt,"employment_length","default",'Default on employment length','employment_length','Count')

Observation

  • This graph shows employment length ploting with default value as 1 & 0
In [40]:
histogram(dt,"installment_rate","default",'Default on installment_rate','installment_rate','Count')

Observation

  • This graph shows Installment rate ploting with default value as 1 & 0
In [42]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['personal_status'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [43]:
px.bar(dt_new, x='default', y='Percent', color='personal_status',title="Default as '1' w.r.t personal status" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Personal status ploting with default value as 1
In [44]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['personal_status'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [45]:
px.bar(dt_new, x='default', y='Percent', color='personal_status',title="Default as '0' w.r.t personal status" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Personal status ploting with default value as 0
In [46]:
histogram(dt,"personal_status","default",'Default on personal status','personal status','Count')

Observation

  • This graph shows Personal status ploting with default value as 1 & 0
In [52]:
histogram(dt,"other_debtors","default",'Default on other debtors','other debtors','Count')

Observation

  • This graph shows Other debtors ploting with default value as 0 & 1
In [54]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['property'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [55]:
px.bar(dt_new, x='default', y='Percent', color='property',title="Default as '1' w.r.t property" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows property type ploting with default value as 1
In [56]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['property'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [57]:
px.bar(dt_new, x='default', y='Percent', color='property',title="Default as '0' w.r.t property" 
                    ,barmode='group', text='Percent')
In [18]:
#### Observation
* **This graph shows property type ploting with default value as 0**
  File "<ipython-input-18-ed9b255c1953>", line 2
    * **This graph shows property type ploting with default value as 0**
       ^
SyntaxError: invalid syntax
In [58]:
histogram(dt,"property","default",'Default on property','property','Count')

Observation

  • This graph shows property type ploting with default value as 1 & 0
In [60]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['installment_plan'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [61]:
px.bar(dt_new, x='default', y='Percent', color='installment_plan',title="Default as '1' w.r.t installment_plan" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Installment plan ploting with default value as 1
In [62]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['installment_plan'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [63]:
px.bar(dt_new, x='default', y='Percent', color='installment_plan',title="Default as '0' w.r.t installment_plan" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Installment plan ploting with default value as 0
In [64]:
histogram(dt,"installment_plan","default",'Default on installment_plan','installment_plan','Count')

Observation

  • This graph shows Installment plan ploting with default value as 1 & 0
In [70]:
histogram(dt,"housing","default",'Default on housing','housing','Count')

Observation

  • This graph shows Installment plan ploting with default value as 1 & 0
In [72]:
dt_new=dt[dt['default']==1]
dt_new=dt_new.groupby('default')['job'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [73]:
px.bar(dt_new, x='default', y='Percent', color='job',title="Default as '1' w.r.t job" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Jobs ploting with default value as 1
In [74]:
dt_new=dt[dt['default']==0]
dt_new=dt_new.groupby('default')['job'].value_counts(normalize=True)
dt_new = dt_new.mul(100).rename('Percent').reset_index()
dt_new['Percent']=dt_new['Percent'].round(decimals=2)
In [75]:
px.bar(dt_new, x='default', y='Percent', color='job',title="Default as '0' w.r.t job" 
                    ,barmode='group', text='Percent')

Observation

  • This graph shows Jobs ploting with default value as 0
In [76]:
histogram(dt,"job","default",'Default on job','Job','Count')

Observation

  • This graph shows Installment plan ploting with default value as 1 & 0

Feature-Selection

In [19]:
def correlation_feature(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr
In [21]:
corr_features = correlation_feature(dt, 0.7)
len(set(corr_features))
Out[21]:
0
In [22]:
corr_features
Out[22]:
set()
In [23]:
dt.describe(include='object')
Out[23]:
checking_balance credit_history purpose savings_balance employment_length personal_status other_debtors property installment_plan housing telephone foreign_worker job
count 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000
unique 4 5 10 5 5 4 3 4 3 3 2 2 4
top unknown repaid radio/tv < 100 DM 1 - 4 yrs single male none other none own none yes skilled employee
freq 394 530 280 603 339 548 907 332 814 713 596 963 630

Observation

  • The above fields are object type. By using label encoder I will conver them to interger type
In [31]:
label_encoder=LabelEncoder()
In [35]:
dt['checking_balance'] = label_encoder.fit_transform(dt['checking_balance'])
In [36]:
dt['credit_history'] = label_encoder.fit_transform(dt['credit_history'])
In [37]:
dt['purpose'] = label_encoder.fit_transform(dt['purpose'])
In [38]:
dt['savings_balance'] = label_encoder.fit_transform(dt['savings_balance'])
In [39]:
dt['employment_length'] = label_encoder.fit_transform(dt['employment_length'])
In [40]:
dt['personal_status'] = label_encoder.fit_transform(dt['personal_status'])
In [41]:
dt['other_debtors'] = label_encoder.fit_transform(dt['other_debtors'])
In [42]:
dt['property'] = label_encoder.fit_transform(dt['property'])
In [43]:
dt['installment_plan'] = label_encoder.fit_transform(dt['installment_plan'])
In [44]:
dt['housing'] = label_encoder.fit_transform(dt['housing'])
In [45]:
dt['telephone'] = label_encoder.fit_transform(dt['telephone'])
In [46]:
dt['foreign_worker'] = label_encoder.fit_transform(dt['foreign_worker'])
In [47]:
dt['job'] = label_encoder.fit_transform(dt['job'])
In [48]:
dt.dtypes
Out[48]:
checking_balance        int64
months_loan_duration    int64
credit_history          int64
purpose                 int64
amount                  int64
savings_balance         int32
employment_length       int32
installment_rate        int64
personal_status         int32
other_debtors           int32
residence_history       int64
property                int32
age                     int64
installment_plan        int32
housing                 int32
existing_credits        int64
default                 int64
dependents              int64
telephone               int32
foreign_worker          int32
job                     int32
dtype: object
In [156]:
dt.head()
Out[156]:
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors residence_history property age installment_plan housing existing_credits default dependents telephone foreign_worker job
0 1 6 0 7 1169 4 3 4 3 2 4 2 67 1 1 2 0 1 1 1 1
1 0 48 4 7 5951 2 1 2 1 2 2 2 22 1 1 1 1 1 0 1 1
2 3 12 0 4 2096 2 2 2 3 2 3 2 49 1 1 1 0 2 0 1 3
3 1 42 4 5 7882 2 2 2 3 1 4 0 45 1 0 1 0 2 0 1 1
4 1 24 1 1 4870 2 1 3 3 2 4 3 53 1 0 2 1 2 0 1 1
In [210]:
x= dt.drop(['default'],axis =1)
y = dt['default']
In [159]:
from sklearn.feature_selection import SelectKBest,f_classif
from sklearn.feature_selection import chi2
ordered_rank_features=SelectKBest(score_func=chi2,k=10)
ordered_feature=ordered_rank_features.fit(x,y)
dfscores=pd.DataFrame(ordered_feature.scores_,columns=["Score"])
dfcolumns=pd.DataFrame(x1.columns)
features_rank=pd.concat([dfcolumns,dfscores],axis=1,sort=True)
features_rank.columns=['Features','Score']
features_rank
Out[159]:
Features Score
0 checking_balance 30.237955
1 months_loan_duration 5.838249
2 credit_history 5.649873
3 purpose 1.684508
4 amount 3.471087
5 savings_balance 1.529720
6 employment_length 1.251504
7 installment_rate 1.107338
8 personal_status 1.463584
9 other_debtors 0.042923
10 residence_history 0.001936
11 property 0.284452
12 age 1.158567
13 installment_plan 0.283617
14 housing 0.049109
15 existing_credits 0.571000
16 dependents 0.007680
17 telephone 0.792551
18 foreign_worker 0.249271
19 job 0.251227
In [160]:
features_rank.nlargest(15,'Score')
Out[160]:
Features Score
0 checking_balance 30.237955
1 months_loan_duration 5.838249
2 credit_history 5.649873
4 amount 3.471087
3 purpose 1.684508
5 savings_balance 1.529720
8 personal_status 1.463584
6 employment_length 1.251504
12 age 1.158567
7 installment_rate 1.107338
17 telephone 0.792551
15 existing_credits 0.571000
11 property 0.284452
13 installment_plan 0.283617
19 job 0.251227

By chi2 method we detect the above parameter which are highly co-related with the target parameter

In [161]:
from sklearn.ensemble import ExtraTreesRegressor
model=ExtraTreesRegressor()
model.fit(x,y)
print(model.feature_importances_)
[0.13408369 0.09319668 0.048795   0.06003005 0.07458963 0.05458722
 0.0581437  0.04591523 0.04662744 0.03076764 0.04895868 0.0526795
 0.06176367 0.03453667 0.03292051 0.03042888 0.01592584 0.0278574
 0.00617731 0.04201528]
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

In [162]:
fet_import=pd.Series(model.feature_importances_,index=x1.columns)
fet_import.nlargest(15).plot(kind='barh')
plt.show()
In [163]:
fet_import.nlargest(15)
Out[163]:
checking_balance        0.134084
months_loan_duration    0.093197
amount                  0.074590
age                     0.061764
purpose                 0.060030
employment_length       0.058144
savings_balance         0.054587
property                0.052679
residence_history       0.048959
credit_history          0.048795
personal_status         0.046627
installment_rate        0.045915
job                     0.042015
installment_plan        0.034537
housing                 0.032921
dtype: float64

We also applied the Extratree classifier method to detect the highly co-related parameter with the target paramter

Model Development

  • In this section I will use differnt machine learning algorithm to predict the defaulter for the available data set

Performance Metric

  • I am going to use Confusion matrix and accuracy score to check how accurately my model is predicting when a new data is fed to the model
In [190]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
In [193]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape
Out[193]:
((700, 20), (700,), (300, 20), (300,))

Logistic Regression

In [194]:
model_lr=LogisticRegression()
model_lr.fit(x_train,y_train)
predict_lr_tr=model_lr.predict(x_train)
predict_lr_test=model_lr.predict(x_test)
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

In [195]:
cm=confusion_matrix(y_train,predict_lr_tr)
sns.heatmap(cm,annot=True,fmt="d")
Out[195]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d55a886d8>
In [196]:
cm1=confusion_matrix(y_test,predict_lr_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[196]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d55a8d518>
In [197]:
print(accuracy_score(y_train,predict_lr_tr))
print(classification_report(y_train,predict_lr_tr))
0.74
              precision    recall  f1-score   support

           0       0.77      0.89      0.83       491
           1       0.60      0.38      0.47       209

    accuracy                           0.74       700
   macro avg       0.69      0.64      0.65       700
weighted avg       0.72      0.74      0.72       700

In [198]:
print(accuracy_score(y_test,predict_lr_test))
print(classification_report(y_test,predict_lr_test))
0.6833333333333333
              precision    recall  f1-score   support

           0       0.73      0.87      0.79       209
           1       0.46      0.25      0.33        91

    accuracy                           0.68       300
   macro avg       0.59      0.56      0.56       300
weighted avg       0.65      0.68      0.65       300

Observation

As We know when the accuracy score is high it shows that our model is working fine
Here the accuracy score is only 68% hence we will try different algorithm to train our model

RandomForestClassifier

In [199]:
model_rf=RandomForestClassifier()
model_rf.fit(x_train,y_train)
predict_rf_tr=model_rf.predict(x_train)
predict_rf_test=model_rf.predict(x_test)
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

In [200]:
cm=confusion_matrix(y_train,predict_rf_tr)
sns.heatmap(cm,annot=True,fmt="d")
Out[200]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d55b9ccf8>
In [201]:
cm1=confusion_matrix(y_test,predict_rf_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[201]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d55c01160>
In [202]:
print(accuracy_score(y_train,predict_rf_tr))
print(classification_report(y_train,predict_rf_tr))
0.9814285714285714
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       491
           1       1.00      0.94      0.97       209

    accuracy                           0.98       700
   macro avg       0.99      0.97      0.98       700
weighted avg       0.98      0.98      0.98       700

In [203]:
print(accuracy_score(y_test,predict_rf_test))
print(classification_report(y_test,predict_rf_test))
0.7333333333333333
              precision    recall  f1-score   support

           0       0.75      0.92      0.83       209
           1       0.62      0.31      0.41        91

    accuracy                           0.73       300
   macro avg       0.69      0.61      0.62       300
weighted avg       0.71      0.73      0.70       300

Observation

After applying random forest we can see the accuracy score has improved but we will try to improve more

Fine Tune my model

In [204]:
from sklearn.model_selection import GridSearchCV
In [205]:
model_params = {
    'n_estimators': [50, 150, 250],
    'max_features': ['sqrt', 0.25, 0.5, 0.75, 1.0],
    'min_samples_split': [2, 4, 6]
}
In [206]:
rf_model = RandomForestClassifier(random_state=1)
clf = GridSearchCV(rf_model, model_params, cv=5)
model = clf.fit(x_train,y_train)
In [207]:
grid_predict=model.predict(x_test)
In [208]:
cm1=confusion_matrix(y_test,grid_predict)
sns.heatmap(cm1,annot=True,fmt='d')
Out[208]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d55c81240>
In [209]:
print(accuracy_score(y_test,grid_predict))
print(classification_report(y_test,grid_predict))
0.75
              precision    recall  f1-score   support

           0       0.78      0.89      0.83       209
           1       0.63      0.42      0.50        91

    accuracy                           0.75       300
   macro avg       0.71      0.66      0.67       300
weighted avg       0.73      0.75      0.73       300

Observation

After fine tunning the model we don't see much improvement in our accuracy.
Now we will scale the dataset in order to improve the accuracy

In [211]:
min_dt=dt.min()
range_dt=(dt-min_dt).max()
dt_scaled = (dt-min_dt)/range_dt
In [212]:
x= dt_scaled.drop(['default'],axis =1)
y = dt_scaled['default']
In [164]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
In [165]:
model_lr=LogisticRegression()
model_lr.fit(x_train,y_train)
predict_lr_tr=model_lr.predict(x_train)
predict_lr_test=model_lr.predict(x_test)
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning:

Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.

In [167]:
cm=confusion_matrix(y_train,predict_lr_tr)
sns.heatmap(cm,annot=True,fmt="d")
Out[167]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d522e1438>
In [168]:
cm1=confusion_matrix(y_test,predict_lr_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[168]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d51e38898>
In [171]:
print(accuracy_score(y_train,predict_lr_tr))
print(classification_report(y_train,predict_lr_tr))
0.7457142857142857
              precision    recall  f1-score   support

         0.0       0.77      0.91      0.83       491
         1.0       0.63      0.35      0.45       209

    accuracy                           0.75       700
   macro avg       0.70      0.63      0.64       700
weighted avg       0.73      0.75      0.72       700

In [173]:
print(accuracy_score(y_test,predict_lr_test))
print(classification_report(y_test,predict_lr_test))
0.6866666666666666
              precision    recall  f1-score   support

         0.0       0.73      0.88      0.80       209
         1.0       0.47      0.25      0.33        91

    accuracy                           0.69       300
   macro avg       0.60      0.56      0.56       300
weighted avg       0.65      0.69      0.65       300

In [175]:
from sklearn.ensemble import RandomForestClassifier
In [176]:
model_rf=RandomForestClassifier()
model_rf.fit(x_train,y_train)
predict_rf_tr=model_rf.predict(x_train)
predict_rf_test=model_rf.predict(x_test)
C:\Users\Subhasish Das\AppData\Roaming\Python\Python37\site-packages\sklearn\ensemble\forest.py:245: FutureWarning:

The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.

In [177]:
cm=confusion_matrix(y_train,predict_rf_tr)
sns.heatmap(cm,annot=True,fmt="d")
Out[177]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d4b70c550>
In [178]:
cm1=confusion_matrix(y_test,predict_rf_test)
sns.heatmap(cm1,annot=True,fmt='d')
Out[178]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d480567f0>
In [179]:
print(accuracy_score(y_train,predict_rf_tr))
print(classification_report(y_train,predict_rf_tr))
0.9871428571428571
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99       491
         1.0       1.00      0.96      0.98       209

    accuracy                           0.99       700
   macro avg       0.99      0.98      0.98       700
weighted avg       0.99      0.99      0.99       700

In [180]:
print(accuracy_score(y_test,predict_rf_test))
print(classification_report(y_test,predict_rf_test))
0.7633333333333333
              precision    recall  f1-score   support

         0.0       0.77      0.93      0.85       209
         1.0       0.71      0.37      0.49        91

    accuracy                           0.76       300
   macro avg       0.74      0.65      0.67       300
weighted avg       0.75      0.76      0.74       300

In [181]:
from sklearn.model_selection import GridSearchCV
In [182]:
model_params = {
    'n_estimators': [50, 150, 250],
    'max_features': ['sqrt', 0.25, 0.5, 0.75, 1.0],
    'min_samples_split': [2, 4, 6]
}
In [183]:
rf_model = RandomForestClassifier(random_state=1)
clf = GridSearchCV(rf_model, model_params, cv=5)
model = clf.fit(x_train,y_train)
In [184]:
rf_model = RandomForestClassifier(random_state=1)
clf = GridSearchCV(rf_model, model_params, cv=5)
model = clf.fit(x_train,y_train)
In [185]:
grid_predict=model.predict(x_test)
In [186]:
cm1=confusion_matrix(y_test,grid_predict)
sns.heatmap(cm1,annot=True,fmt='d')
Out[186]:
<matplotlib.axes._subplots.AxesSubplot at 0x17d5073ca90>
In [187]:
print(accuracy_score(y_test,grid_predict))
print(classification_report(y_test,grid_predict))
0.7566666666666667
              precision    recall  f1-score   support

         0.0       0.78      0.90      0.84       209
         1.0       0.66      0.42      0.51        91

    accuracy                           0.76       300
   macro avg       0.72      0.66      0.67       300
weighted avg       0.74      0.76      0.74       300

Observation

  • So after scalling the dataset we get to see a improvement in accuracy score for our Random forest model
  • So we can conclude that the model we created using Random forest algorith with the scalled data is the best model for our data set
In [ ]: